{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g_a9QvUFVCUR"
   },
   "source": [
    "<h1>Chapter 2 - Tokens and Token Embeddings</h1>\n",
    "<i>Exploring tokens and embeddings as an integral part of building LLMs</i>\n",
    "\n",
    "\n",
    "<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\"><img src=\"https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon\"></a>\n",
    "<a href=\"https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/\"><img src=\"https://img.shields.io/badge/O'Reilly-white.svg?logo=\"></a>\n",
    "<a href=\"https://github.com/HandsOnLLM/Hands-On-Large-Language-Models\"><img src=\"https://img.shields.io/badge/GitHub%20Repository-black?logo=github\"></a>\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)\n",
    "\n",
    "---\n",
    "\n",
    "This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).\n",
    "\n",
    "---\n",
    "\n",
    "<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\">\n",
    "<img src=\"https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png\" width=\"350\"/></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [OPTIONAL] - Installing Packages on <img src=\"https://colab.google/static/images/icons/colab.png\" width=100>\n",
    "\n",
    "If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:\n",
    "\n",
    "---\n",
    "\n",
    "💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to\n",
    "**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%capture\n",
    "# !pip install --upgrade transformers==4.41.2 sentence-transformers==3.0.1 gensim==4.3.2 scikit-learn==1.5.0 accelerate==0.31.0 peft==0.11.1 scipy==1.10.1 numpy==1.26.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oQHfpqT_t9-K"
   },
   "source": [
    "# Downloading and Running An LLM\n",
    "\n",
    "The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 753,
     "referenced_widgets": [
      "851b6e59cc2e4eb8961cb5fa4906c47c",
      "fd5b6ec0a82c493a92a2235635ea5ac0",
      "2bc5005713ba4b61a392935e1d83994c",
      "607bb5b3f27d463dbdb5ad04583a1c4f",
      "0d0c6c47fbb34090a73ed8bf20597ee5",
      "083149fa68934cce90cd03b6c8f90dd4",
      "73263751695843ada347c6c694afd0cd",
      "79c1152a61dd4a87a6a4e700096d92b4",
      "27fb45533f134c308cc0c27797c127f3",
      "78eb9f6c02e841ad9ecf8ed809c724bb",
      "1ade569c5cec4764af86e7eec6bc64ed",
      "dbf8268e613e4e1498133a42a48a58f8",
      "66c3d79c9411447eabae228b5d125c09",
      "46ba964cc4c84bbb87eeccc817a3354c",
      "150625fb48b740f6bce66fd4919357a9",
      "974be84551a54deb82051b947ff013af",
      "22e2cbe577e6475aa80dd94f4b704f1b",
      "83daaa4a1a5d437e8345d4740c3625cd",
      "6e1c0e41b31b4b0b83aa3736e386aa49",
      "bf1d39ed8ee84a6b95e68a11962032b4",
      "7476eb73c9544435b07256f028168a11",
      "f33bc31f0cb842388057b33dd107f2e7",
      "9b9f8a8f0eb14dd9a2c82461f22c4636",
      "214eb5ac9a5048c586b3418452017991",
      "dd1b7306ad4641799d4947a94aa0f088",
      "d58e84330cff499c87e03827d9727d19",
      "2809ac7769af482a86d5e763f5f69211",
      "746054d4ac0b442e85b75959b66ceb34",
      "9b77017c3e764c97b44c56fb5ad3bc77",
      "3dc073c6257143748849311e1c08bdbe",
      "3117fa7446394cb6b6b56427be0a3290",
      "5c3c4612f65f40808f8ef1d0e31f9836",
      "f66983763d454728b86d881621948b8d",
      "a855727c923648ea8d6c7b9ec117c94f",
      "6c9350120fec4f4e9b80e063279b459c",
      "28283e52317e437c8e118d052d93419c",
      "5556b8b2f22548109153146ef66702df",
      "dc1457b4d1d344c9863fbb6b0d38e2cf",
      "029a23c145604f209b12044a6b367802",
      "252ccf08bd52433a888d2bac9ed1b64c",
      "893af8dae5874b55a582f180652e36e3",
      "27e14bb1e4aa467c97ac0c6a6eb44108",
      "87812873ff2d4514aca02c64a3509bd9",
      "13ecb25d0d6541b1a4f2b7dea92dee61",
      "2107a7e7d40f416ca4bc46c90725e0ca",
      "1cfc631c1d7d4fb087753e9e34fd1aa2",
      "238f094ef6744b31ba4444d47c17558c",
      "b11bf972b98645e79f98cdec2c1440f3",
      "baa7c53a02cd42688d768a4595611f64",
      "6a63559431934a4aae4ce79e8b2e76c0",
      "81652b39152145748580c9667b2554de",
      "4dc8d9b1a2d248f9a30bc8e985db4568",
      "adfcbd50473c4ea9bb414e33926cc33b",
      "52a9a5aed89a4294882c8c55b2078b84",
      "83fb238d4eba47a390233ecb5e870ed4",
      "8da9bab6ee214504a187e5b3c9bf1b80",
      "77922a918c1b4bb08f5304a34044e9ba",
      "ac36ee2d4df04399b11f57bb56930a52",
      "df2f73dc04d64bf6adb4df0eae207c27",
      "3cbc759c16b24d63971ba5ef53ff9ba3",
      "ea5d84bf35754e71a628755c0cff49be",
      "3ed30ef053324b3f9a5d3258d2a88a86",
      "f8bae10ac5b54f1b99f7b595f123be4d",
      "dfb5b5a357ca41f88efa53288ebeb5b3",
      "b917827c07344e018543fcbb9c7a6fe8",
      "bad199037f6548749f4dfe12724e3e62",
      "0e30dc616d9c4c50adb9cd292fc3d89e",
      "570a9d836e48456d8c301184d670c50e",
      "471cdd69adc24003b14680e8dbe7b183",
      "c11552c4f423434aa2c41dfa22b8f227",
      "687c61c78b044690a91e34affe570613",
      "82dcc51143ac46349c13d2f5bee380da",
      "a9c8a9b099aa416d9015ee48006db324",
      "1fcc7e4cd65449c49956981ec1b46843",
      "5fc8310895a64815b15af53e3583f60d",
      "ffa0950ea21a4689b5804b0285049ccd",
      "da9ec5fd7acd43b7a11ac860771732df",
      "961d4e8f3a1845e393e25d21de3cd6d3",
      "e7506426cc424dfdb28fa6c35cb74c24",
      "d9b26784fc7c4e6b8abe9490caf53afc",
      "e96d58589fcc4bd7a469c4281baa01ee",
      "dd8fa47d73774a7684cda73e6675c0bd",
      "41843fb3c92c425b9f39cb124b74e3ad",
      "2ab54f26cc504dc09db592039c02c210",
      "cab531e67dd3452f9a4fcd17e324d76a",
      "58de29a54c91401d92cc73a96721a3e9",
      "0eabbb5ac0784a1c89ad58cca37a38fe",
      "362724ce1c5944bd9335b66a14f89844",
      "bd2749d8ba7240408be56e26c796e00a",
      "b54e7ab6e22345bdae3484cab6dab985",
      "77d3a6cc940a4dd7a38c47dfc2b7041b",
      "13071c70f17144b891a1f0fde6d9b21e",
      "205628f7d62d4e54add8ee90b418ebdf",
      "037c74818af74266b2bce5a454b99ab8",
      "bba33c0c581d415bba86aebfc8642196",
      "294ee2f27b4c4ee1ad6ca2a121a9e66d",
      "5bdf1b7758624e379c71da5296929973",
      "6cae5555dce346dc9867edd130c840c9",
      "0e37d9ed7d3142a3b80f188ddee54a4d",
      "d077b2c0e6a142cb905e38a7c7855997",
      "2a47328eace84efbb8fcf6644d4e469d",
      "c1634b24795243f5b8384957ce11b9c0",
      "b8285e7e6db348c385694d2fd63b514e",
      "da9d77e8fa0641dc97f7610aa59374a2",
      "7209a02303ad46f3880799f0b92171f0",
      "770f06c5ff714c87af1d0d7536da12c9",
      "f59c537df5684195b1b5c44754ed2d07",
      "6dd8dd0a99bd4cbbb7401a3fa3893128",
      "f8c4e2a7f11f47fcbd574ceaf828aa66",
      "193e7cde826d43bd894362498a217888",
      "febedf9dd6d248baa984ac215697968d",
      "0bf43b40190141d5bf8e0be3fdb1e529",
      "4070e2510e1549c3879e2458e42bb269",
      "9ba44d309f684c98991889b992767ab3",
      "b47b04d69b33402f80ea75ed472d3c29",
      "19a091b24a5d44b9a98569a7e124c6f8",
      "0d1cea03253c426b84814936c93e5279",
      "7c1b3811b42549569cc83423bbeb2163",
      "55bc2055b3c5426f8ba9afc9f80ccd58",
      "61098d22dd294891afdd34b6208906ba",
      "0a0db951df42424084b531ebbd6cfa98",
      "3f1dc764fb2c48bd9fd8797a86b6c59f",
      "95146eb05e964e32bfa6f5078a66fab7",
      "8c7e09509c524cf29782df69988eef6d",
      "5237587cfb9c4e86a82aaf72cc9db39d",
      "22fd14ac3ccf4e1d8e4782d35e7ec91c",
      "138a3cb6ba5b494c8a237af0c6306a4f",
      "e88a1d50d7f940d793f4dcd76a5f5929",
      "4182011b06304005a6e8922d338c65fb",
      "16ae1f6d9d844d60b55da9e36a5792cd",
      "56ea52f3a1ea485ca52eebaf4f97c3a3",
      "c81c6fe3a8ed4466bf95c53dfe415fba",
      "157b7540b5a3404e82d7e86b19906c7a",
      "48d180eca04341d692d5fb83a8e0f76d",
      "c84b966b98144e22884183cd2a03e2a8",
      "9a6496f9037149289e64e66a0f46cbd3",
      "ecd15644266b43678dbe7b067b553833",
      "43a6de1c960341eda9e154d47e0142c9",
      "6e8fa3e12a8b4262bbca73c57c7f4eb9",
      "86445e2d62b44847a4be7419a6000305",
      "3837338134c240caa0c51501fae402fb",
      "5e2d4a77710e428ebef6bae0854c8527",
      "63d81ace73634bcdaef6f5a103c35dcf",
      "dd1af9490bc24d5ea2804adefd442fdb",
      "cf18ae8dd5bb4d378f017f9eaff967b4",
      "461fde63ac0a41ed8ad9aab2d60769a7",
      "3c5bb3c7c77f4ca393dc05b12fb69dd0",
      "bd707eacc9914dc69a1fe4e35320a9a4",
      "d7cb5720fbbb4ae0858dd3070c712615",
      "bf390e3a7172415089b2cce2223dd8b6",
      "72ae6f3dcf3d4567a4a60f627a51f1f4",
      "026141ba5e404e4e86c1d1c083ab186d",
      "fcd5f49ec07f401fb8a2b403516bbd17",
      "22d378959eb941c5bbfe77702fb426e6"
     ]
    },
    "executionInfo": {
     "elapsed": 95520,
     "status": "ok",
     "timestamp": 1723034396041,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": -60
    },
    "id": "jjU8NBHnwA4j",
    "outputId": "286bdccb-f25d-4b0e-bda3-44d2a4be45cd"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n",
      "Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "dd45dd8837f94b38ae6f4ffd205d9ea6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "# Load model and tokenizer\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    \"microsoft/Phi-3-mini-4k-instruct\",\n",
    "    device_map=\"cuda\",\n",
    "    torch_dtype=\"auto\",\n",
    "    trust_remote_code=False,\n",
    ")\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"microsoft/Phi-3-mini-4k-instruct\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 5750,
     "status": "ok",
     "timestamp": 1719641447389,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "_iVl5yePuq3B",
    "outputId": "4ce629bf-3897-4ab0-8cf1-8f55e2040155"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:transformers_modules.microsoft.Phi-3-mini-4k-instruct.ff07dc01615f8113924aed013115ab2abd32115b.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<s> Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: My Sincere Apologies for the Gardening Mishap\n",
      "\n",
      "Dear\n"
     ]
    }
   ],
   "source": [
    "prompt = \"Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>\"\n",
    "\n",
    "# Tokenize the input prompt\n",
    "input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
    "\n",
    "# Generate the text\n",
    "generation_output = model.generate(\n",
    "  input_ids=input_ids,\n",
    "  max_new_tokens=20\n",
    ")\n",
    "\n",
    "# Print the output\n",
    "print(tokenizer.decode(generation_output[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 4,
     "status": "ok",
     "timestamp": 1719641447389,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "JmzgbbdKuvHt",
    "outputId": "82511d5b-7949-49a0-e3a6-c128564575c8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[    1, 14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278,\n",
      "         25305,   293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,\n",
      "           920,   372,  9559, 29889, 32001]], device='cuda:0')\n"
     ]
    }
   ],
   "source": [
    "print(input_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 3,
     "status": "ok",
     "timestamp": 1719641447389,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "W4vsjbxwu1K1",
    "outputId": "506f32d1-f058-4cfd-a9cd-13c4dabe80e6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<s>\n",
      "Write\n",
      "an\n",
      "email\n",
      "apolog\n",
      "izing\n",
      "to\n",
      "Sarah\n",
      "for\n",
      "the\n",
      "trag\n",
      "ic\n",
      "garden\n",
      "ing\n",
      "m\n",
      "ish\n",
      "ap\n",
      ".\n",
      "Exp\n",
      "lain\n",
      "how\n",
      "it\n",
      "happened\n",
      ".\n",
      "<|assistant|>\n"
     ]
    }
   ],
   "source": [
    "for id in input_ids[0]:\n",
    "   print(tokenizer.decode(id))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 3,
     "status": "ok",
     "timestamp": 1719641447389,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "A9wRZ3J3u4z1",
    "outputId": "7efaa49c-7a5a-41d7-f000-7aace16007e5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[    1, 14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278,\n",
       "         25305,   293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,\n",
       "           920,   372,  9559, 29889, 32001,  3323,   622, 29901,  1619,   317,\n",
       "          3742,   406,  6225, 11763,   363,   278, 19906,   292,   341,   728,\n",
       "           481,    13,    13, 29928,   799]], device='cuda:0')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generation_output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 275,
     "status": "ok",
     "timestamp": 1723034447362,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": -60
    },
    "id": "7QlHLof3u8A3",
    "outputId": "c2315e1b-91b4-4a1b-9bcc-084f16ac8db1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sub\n",
      "ject\n",
      "Subject\n",
      ":\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.decode(3323))\n",
    "print(tokenizer.decode(622))\n",
    "print(tokenizer.decode([3323, 622]))\n",
    "print(tokenizer.decode(29901))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T9nRducW48bd"
   },
   "source": [
    "# Comparing Trained LLM Tokenizers\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7W0xFIVo5A0S"
   },
   "outputs": [],
   "source": [
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "colors_list = [\n",
    "    '102;194;165', '252;141;98', '141;160;203',\n",
    "    '231;138;195', '166;216;84', '255;217;47'\n",
    "]\n",
    "\n",
    "def show_tokens(sentence, tokenizer_name):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n",
    "    token_ids = tokenizer(sentence).input_ids\n",
    "    for idx, t in enumerate(token_ids):\n",
    "        print(\n",
    "            f'\\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +\n",
    "            tokenizer.decode(t) +\n",
    "            '\\x1b[0m',\n",
    "            end=' '\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Gcc3JjwX5DK-"
   },
   "outputs": [],
   "source": [
    "text = \"\"\"\n",
    "English and CAPITALIZATION\n",
    "🎵 鸟\n",
    "show_tokens False None elif == >= else: two tabs:\"    \" Three tabs: \"       \"\n",
    "12.0*50=600\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 354,
     "status": "ok",
     "timestamp": 1725544666773,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "fCDGSXP75Hv-",
    "outputId": "f2c26835-a857-41db-ff2d-d930d06e512e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m[CLS]\u001b[0m \u001b[0;30;48;2;252;141;98menglish\u001b[0m \u001b[0;30;48;2;141;160;203mand\u001b[0m \u001b[0;30;48;2;231;138;195mcapital\u001b[0m \u001b[0;30;48;2;166;216;84m##ization\u001b[0m \u001b[0;30;48;2;255;217;47m[UNK]\u001b[0m \u001b[0;30;48;2;102;194;165m[UNK]\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtoken\u001b[0m \u001b[0;30;48;2;166;216;84m##s\u001b[0m \u001b[0;30;48;2;255;217;47mfalse\u001b[0m \u001b[0;30;48;2;102;194;165mnone\u001b[0m \u001b[0;30;48;2;252;141;98meli\u001b[0m \u001b[0;30;48;2;141;160;203m##f\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m>\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98melse\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195mtwo\u001b[0m \u001b[0;30;48;2;166;216;84mtab\u001b[0m \u001b[0;30;48;2;255;217;47m##s\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m\"\u001b[0m \u001b[0;30;48;2;141;160;203m/\u001b[0m \u001b[0;30;48;2;231;138;195mt\u001b[0m \u001b[0;30;48;2;166;216;84m/\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mthree\u001b[0m \u001b[0;30;48;2;141;160;203mtab\u001b[0m \u001b[0;30;48;2;231;138;195m##s\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m12\u001b[0m \u001b[0;30;48;2;141;160;203m.\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m*\u001b[0m \u001b[0;30;48;2;255;217;47m50\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m600\u001b[0m \u001b[0;30;48;2;141;160;203m[SEP]\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"bert-base-uncased\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 219,
     "referenced_widgets": [
      "76ff072348d0471abaa566d7de6b8e93",
      "5323bca1afe64a2a9bc9c14ac39ad230",
      "bf9fbc21d76a424dab0c8687bd0874c9",
      "f67d099d1e0343d9a822523223d21a75",
      "1dc884bbe68e4bf4ad1e14ca44b5c2fd",
      "b57a5357a7334d47b39f14e07e2d5708",
      "e608cc17958347c19615577b73926f45",
      "1695867ecb3f402ea8e10920adf651ee",
      "9fb081ecc7374f73b32cb203c7ed042d",
      "2fdb8c6afd2647b695f249b4f8122e52",
      "6e12c295923644ab930dcb51de932fd9",
      "d3316f8df2804c2ba34504b196aed6be",
      "893c6036041346ab9486cb7bde06ad0d",
      "42f7a33f1b554c289a434300bcc19f70",
      "d75a253a2ef648e88e35d765be5d4c34",
      "3b5a28c840ee4ba78676b4c3dbcd7af6",
      "e1175aef0e2e4b17afdb6a62f68887a7",
      "8774144d2cbc44f4bb39305e20cd5093",
      "139f7c4f547f4d08a6126897617989d5",
      "61bafb93125042f5bb7cc1195b459d45",
      "8053c516a8df40498085843ff07a2884",
      "81aa45723ec44d18a0e37844b9c70c4e",
      "33541a7c0d664fa2bd104fc9bc91f1bd",
      "580f25a7f04943e6a5100bdf584f8c97",
      "1848a0b868254c848115fc59b2cdd639",
      "47e387c2bcdc43329354304e5358c224",
      "883b562fa1c7409fbf43d2ac90c29955",
      "9798a6e28f56466f9582a40064ad4c4f",
      "4f31fd12d7c04233856206304b2a1bc7",
      "1460bc4aca764def9120218a963ae183",
      "3933809fb01c43bca62dc220cc94f217",
      "e4caf964309345bf82ad65411c1a7f3f",
      "ea6b6957c38d469abbe07bb20c811d2d",
      "eabd5498553646f18eada3542254cb0b",
      "206f377162614c72a74e73557fece973",
      "deabe5ef1f1f473f81403df9d8923846",
      "4096283e990a47a7bd4c00416fa71788",
      "3598675602e34600ae5c719d67778d24",
      "9ca5334888b249b2bd318be5a97fae7e",
      "7cce748a6f264cb69921fc97d4b8c946",
      "449aaf8bc9a64f6483eff88cc4678f6c",
      "e983503abea646a68dfe45652f7e78d1",
      "06b9fe5f3e644ddab47a17a3276e1a67",
      "4128b1818d30497f9a3d6869e02addaa"
     ]
    },
    "executionInfo": {
     "elapsed": 1520,
     "status": "ok",
     "timestamp": 1719589575187,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "0Ay_NX3K5HyP",
    "outputId": "4a32ab93-75f2-4b70-a55b-b643283c8270"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "76ff072348d0471abaa566d7de6b8e93",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d3316f8df2804c2ba34504b196aed6be",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "33541a7c0d664fa2bd104fc9bc91f1bd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eabd5498553646f18eada3542254cb0b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m[CLS]\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203mand\u001b[0m \u001b[0;30;48;2;231;138;195mCA\u001b[0m \u001b[0;30;48;2;166;216;84m##PI\u001b[0m \u001b[0;30;48;2;255;217;47m##TA\u001b[0m \u001b[0;30;48;2;102;194;165m##L\u001b[0m \u001b[0;30;48;2;252;141;98m##I\u001b[0m \u001b[0;30;48;2;141;160;203m##Z\u001b[0m \u001b[0;30;48;2;231;138;195m##AT\u001b[0m \u001b[0;30;48;2;166;216;84m##ION\u001b[0m \u001b[0;30;48;2;255;217;47m[UNK]\u001b[0m \u001b[0;30;48;2;102;194;165m[UNK]\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtoken\u001b[0m \u001b[0;30;48;2;166;216;84m##s\u001b[0m \u001b[0;30;48;2;255;217;47mF\u001b[0m \u001b[0;30;48;2;102;194;165m##als\u001b[0m \u001b[0;30;48;2;252;141;98m##e\u001b[0m \u001b[0;30;48;2;141;160;203mNone\u001b[0m \u001b[0;30;48;2;231;138;195mel\u001b[0m \u001b[0;30;48;2;166;216;84m##if\u001b[0m \u001b[0;30;48;2;255;217;47m=\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m>\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195melse\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47mtwo\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47mThree\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m[SEP]\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"bert-base-cased\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 284,
     "referenced_widgets": [
      "77d1608ca87f4bc1b7a731c896d86db9",
      "6e2c83d85af0419c81df10e85e31d29d",
      "5d3e86f8d3f949aeacabace1f7640d81",
      "6fd2baf1fc1244d38fc41cde100d7b6e",
      "41965c378ba243339352ad3926d48862",
      "873384e3c478450ea4a5a9e061c87133",
      "dcce05965068457fb764ae1c04066d88",
      "41f2087fd27a4c018087063e8e7629d3",
      "905ac86fe3294497b1722e955d63ed4c",
      "131c47105d324eddaea5c241829e878a",
      "22f955573c2845cfba5c314b59d26739",
      "41301a0547754ecfbd6044f6eacb0b8d",
      "eff43edf235a4b92b0e26d5bc21fc909",
      "bbdc2b4c70a0426aa3d2043d0b91b839",
      "405476ddfe634ad793e28474dfe30ecc",
      "8575a84785714069921bbfdc13fb957e",
      "bad64205077a496f96e6d03d927140ba",
      "ca152f8ec99e48b39ee5267269eeaca0",
      "461e3c04697641359924ae0902b13db0",
      "f15e240b2d01488698caa3275e0bacc1",
      "4e2491afa9fc4d65b95e9471af782e4d",
      "ff1f8b630ceb449e910ca34d969fdafd",
      "fa60038dc7c547b8b1d9c54f88fd6b39",
      "e69fe73aa3d44de0bddbe1711269bd8e",
      "5ff700af61664f0eaebe580c8a49a910",
      "e901ed75738f4e41b79651dc012003a6",
      "b115c5c5193f489f87209bb5c6d788f9",
      "3bce3be7198c4917a6cb2183e1344e2c",
      "dedac5ecd5844ee29e346f465074d3fd",
      "4521cc909b2942de889a37dbec1f0277",
      "645a2e3dba2f49fb9ce9c0e2b2a8e73f",
      "198833f7fa2f4ff8ab064b4671461830",
      "66acf835e274473f84ccd486a99e71d2",
      "db2934af14274fe78ffc85f7d03fd1c8",
      "9be5d1e096934134a00974cf8e3fa63c",
      "fc2c07f1eeee43e3aad438206929f5df",
      "08d6a11cebf840748261e0ba6970092b",
      "f8912b0da8aa4f7499ad3f4e5ccca84b",
      "c87ad6d49a054fc8850bebf87c444ee3",
      "17b6d632e333476b99f4315fe737d359",
      "2661e810b7084f93a4dcd454ea7665a0",
      "042557f8b84b422882c651f910a9fce0",
      "8f95639d6dc946f18f14d8a16e73b4a4",
      "89224986b13645d9a9dbedb038e795bb",
      "2fcd1b9c380f422291096e57a6c7f85e",
      "142133a41f664fdf82a1b16d87a68ae5",
      "d1c2b3aac5cc4f3fad1413f8cfdc04e3",
      "bce267a75b9946ae8b0db42dd7f925d1",
      "3b46d5e1b7fe4427909d1c82debd7ba7",
      "3dc1a66c56fb428aad53f5221ed1ae18",
      "23cda173696645b6955515990b6834ec",
      "2141c20003154d8dab2855deb44d3aad",
      "2213d4aa55eb4ef384eaf879552dcd7d",
      "56009a7bf6fc4d7c8300fc9dc4d6ad14",
      "551cdd6ea1d94ba8a6cce7b00798c63b"
     ]
    },
    "executionInfo": {
     "elapsed": 2010,
     "status": "ok",
     "timestamp": 1719589579935,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "K_k5QduY5H0u",
    "outputId": "2e844f23-3dee-4078-8d51-4c250d2c2f3e"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "77d1608ca87f4bc1b7a731c896d86db9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "41301a0547754ecfbd6044f6eacb0b8d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fa60038dc7c547b8b1d9c54f88fd6b39",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "db2934af14274fe78ffc85f7d03fd1c8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2fcd1b9c380f422291096e57a6c7f85e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAP\u001b[0m \u001b[0;30;48;2;166;216;84mITAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZ\u001b[0m \u001b[0;30;48;2;102;194;165mATION\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
      "\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m�\u001b[0m \u001b[0;30;48;2;166;216;84m�\u001b[0m \u001b[0;30;48;2;255;217;47m �\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
      "\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165mok\u001b[0m \u001b[0;30;48;2;252;141;98mens\u001b[0m \u001b[0;30;48;2;141;160;203m False\u001b[0m \u001b[0;30;48;2;231;138;195m None\u001b[0m \u001b[0;30;48;2;166;216;84m el\u001b[0m \u001b[0;30;48;2;255;217;47mif\u001b[0m \u001b[0;30;48;2;102;194;165m ==\u001b[0m \u001b[0;30;48;2;252;141;98m >=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m \u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
      "\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 183,
     "referenced_widgets": [
      "63772475e9234672994f2a8edf89b192",
      "fa45fdb364444208b2693760809e3c60",
      "5a92b1072a834d69ad33060ae1b0fdf0",
      "b8a3d22de4964f369ecabc31b6cdca57",
      "274983c981654d1eb25d630d8d5e47e3",
      "eebaba8a83ed48f8afe2ca129f3a73c9",
      "2d7a3c68a2e246059a93fea280f4c2c8",
      "22983afeb04d4a06b8b01527d869585e",
      "f7492077d2ba4ccf8df039473057321d",
      "fb255528c7754047aeb29863dd642b19",
      "cc7f7f3ef40042458ac7565b070af032",
      "bc67bb86f76e482ab703f3f403f4cc76",
      "a7605351b83941698348ee84cd99f955",
      "88af753262344cfe9a88b133540bcaf7",
      "8f92316b17a24623b8646747f5ecc7d6",
      "8153c7e3f21f44f09a3222da6312137d",
      "22bee52a95d243868b40b3f8ce5ca7d9",
      "f09f8925fc34454dbb970c31d5d82707",
      "354c2db5dbd34284baf62a7529537b8b",
      "028db37ce29c45939adeed5ae311583c",
      "dd534b3f89c64de9b9aa4f7a95a05f34",
      "30ed305df62c45329f24a2f64499d490",
      "36e3e6b45fca44c8b01a729189b1bdab",
      "4727da03ed724b8dafff24856652fe95",
      "b580d17fa1134425a837a48aba06dbf4",
      "e71172ffd8d74f189cea18e4898c4c2d",
      "ce0031117cf347f48d027cd70e87193b",
      "9ee79c669c6641d887fa284286435f57",
      "9885aaa3e0be4052af4848b92c642cdd",
      "ce6eee6121334479a72e023b252124d2",
      "3adda22d2bca47a886da662621ec9a9d",
      "6a1a34996ae24a14972fd425be48dd7c",
      "2c323975d4454113a863e3ec0b56f4fb",
      "4b9fab6416924d509bbb6361f63797e1",
      "a721b2fba975474c8a3d9384cf998228",
      "1f76260c0f6c46598bee13eb4a0f8b65",
      "fd95f752cc684613b0f6c6db12af874f",
      "3c0a4cbec1bf4c7886a1a9271c1b0832",
      "2489776fa74e4dccb4154368f3861623",
      "c8506a9393604d119ff71264e27b6734",
      "93187864ee6d4d41b49df1c84f35e6e2",
      "d7cbd635afec493ebcc4973f8d98c58b",
      "d09ad050869c47fb9fbc558f8b6d47d7",
      "b4e35a7edeea4b089182f4e9b15dc12e"
     ]
    },
    "executionInfo": {
     "elapsed": 1618,
     "status": "ok",
     "timestamp": 1719589589160,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "EJn5nf3c5H2_",
    "outputId": "607c38ff-9425-4371-f5e0-1f8ee9449eee"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "63772475e9234672994f2a8edf89b192",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bc67bb86f76e482ab703f3f403f4cc76",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "36e3e6b45fca44c8b01a729189b1bdab",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4b9fab6416924d509bbb6361f63797e1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165mEnglish\u001b[0m \u001b[0;30;48;2;252;141;98mand\u001b[0m \u001b[0;30;48;2;141;160;203mCA\u001b[0m \u001b[0;30;48;2;231;138;195mPI\u001b[0m \u001b[0;30;48;2;166;216;84mTAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZ\u001b[0m \u001b[0;30;48;2;102;194;165mATION\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m \u001b[0;30;48;2;141;160;203m<unk>\u001b[0m \u001b[0;30;48;2;231;138;195m\u001b[0m \u001b[0;30;48;2;166;216;84m<unk>\u001b[0m \u001b[0;30;48;2;255;217;47mshow\u001b[0m \u001b[0;30;48;2;102;194;165m_\u001b[0m \u001b[0;30;48;2;252;141;98mto\u001b[0m \u001b[0;30;48;2;141;160;203mken\u001b[0m \u001b[0;30;48;2;231;138;195ms\u001b[0m \u001b[0;30;48;2;166;216;84mFal\u001b[0m \u001b[0;30;48;2;255;217;47ms\u001b[0m \u001b[0;30;48;2;102;194;165me\u001b[0m \u001b[0;30;48;2;252;141;98mNone\u001b[0m \u001b[0;30;48;2;141;160;203m\u001b[0m \u001b[0;30;48;2;231;138;195me\u001b[0m \u001b[0;30;48;2;166;216;84ml\u001b[0m \u001b[0;30;48;2;255;217;47mif\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m>\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84melse\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165mtwo\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165mThree\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m12.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\u001b[0m \u001b[0;30;48;2;252;141;98m</s>\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"google/flan-t5-small\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 714,
     "status": "ok",
     "timestamp": 1723035784494,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": -60
    },
    "id": "1ymhAsTg5H5e",
    "outputId": "7827a535-4f33-4620-f4e7-4a2b622a78c2"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAPITAL\u001b[0m \u001b[0;30;48;2;166;216;84mIZATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
      "\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m �\u001b[0m \u001b[0;30;48;2;166;216;84m�\u001b[0m \u001b[0;30;48;2;255;217;47m�\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_tokens\u001b[0m \u001b[0;30;48;2;231;138;195m False\u001b[0m \u001b[0;30;48;2;166;216;84m None\u001b[0m \u001b[0;30;48;2;255;217;47m elif\u001b[0m \u001b[0;30;48;2;102;194;165m ==\u001b[0m \u001b[0;30;48;2;252;141;98m >=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m   \u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m Three\u001b[0m \u001b[0;30;48;2;166;216;84m tabs\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165m \"\u001b[0m \u001b[0;30;48;2;252;141;98m      \u001b[0m \u001b[0;30;48;2;141;160;203m \"\n",
      "\u001b[0m \u001b[0;30;48;2;231;138;195m12\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m50\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195m600\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
      "\u001b[0m "
     ]
    }
   ],
   "source": [
    "# The official is `tiktoken` but this the same tokenizer on the HF platform\n",
    "show_tokens(text, \"Xenova/gpt-4\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 284,
     "referenced_widgets": [
      "770da5f1b8b24972b2018e1cadd3ec8a",
      "e2f255083a1b4d8f9992f27b4d21c676",
      "4fac840948a048d6b1ca7c0dc4f4c5d5",
      "c2ac158eb3f0469ca90f7247c546c70f",
      "4a8cfc4637124995866810ef1b750fe4",
      "409169ca000a484ca4472750cfe63f30",
      "8663795c551e457dafc93d02cf0026c3",
      "5639fe0c03e7451db316579356290e3d",
      "46e73023abe1465382771e9af87f36fc",
      "5feaeba2e88a4a7fb83a85f3200e2639",
      "76a8030af7794356b5c2daa891d789e2",
      "37c46ab78fc64eb98923b24d6a0de37e",
      "3bb4b5235ef74c7d89489f8fa8cded17",
      "a720bb387fca45968c75352398935382",
      "683b85afadb744e4bd7164c51f01d3f9",
      "00dd050102674a1ab3fd8d8f9caec4b0",
      "cf6a7c6ada024f8e9f106428d506b078",
      "33744d7c827e4784a91955159a47e337",
      "f6dce141c94d4c8494f75a7387b65331",
      "48738fb1cf8e4f0fb70b74a5896669cd",
      "1b10141545cb489fa3a58d4939cc4d9b",
      "99b9c874e58c4db9b596b6ca1699e666",
      "07bf43728198472997c8b59b9343adfe",
      "74d33e70d8af43148fd3a618b5d3c5dd",
      "8008a03780b24639abce64498b1d832e",
      "82ad72412e1343b983679e625c85f47d",
      "0d3aa270949048a5886de118b1a3b1f1",
      "cc568e7a8ca84810ab878e601fae557a",
      "cb57c7f3455f4a34b59ae39d0b599b8b",
      "577a22cb6c7549ff96f367bd6f4f8b12",
      "d0d4c92c9a0f4bd29255d8ff47d18c11",
      "e6d9b96a5cb9487d90136d097e716a5a",
      "e878edbee8ea48178b424e56417b7fa5",
      "e227f5f6bb3b4580b0ea4304d34ad556",
      "36863dd97aa04c48831d1fb455557adc",
      "ece59919873646f9bbf41c7547e802a3",
      "7ff1f54520324b3e9462062ecd87ce69",
      "2e8d55afb3fc4e2fa6b0b887a09b7ca9",
      "e527a040f0be4d43830ae6d4335771b0",
      "32f18ab0328146b5aac38b4c7ef8029d",
      "5d5d4e02a6724861aa36d9af5ea70ea7",
      "c8637134a894493093654456f2a9763b",
      "8a1023f076f34f34ab0b091d7f62c172",
      "3537e0361ab5475282b7f34adbfc70dc",
      "12d4df348cd34dc2b3d7ffc41f0561e8",
      "65326149d62e4404ba49a8d2d505adac",
      "b84c1bff2fcc48ed8fee636e1bdb16f9",
      "a8ae6ec72f2744b4991de0a961e1b142",
      "98182c58a56343e482ed44935be4fd31",
      "15719135708047a49a19887989dac12d",
      "a43ccd6d114644ae86c3129786f8105a",
      "90cc27a829e54b16850552e04f718bc3",
      "2be39bb4eacf41e381432b37febfd788",
      "df4dd594480d446da1523bd2d016c1cb",
      "11ea292f268d481396192b158814e6b1"
     ]
    },
    "executionInfo": {
     "elapsed": 9948,
     "status": "ok",
     "timestamp": 1719590292199,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "3_vAyeTy5H7_",
    "outputId": "ad3f759f-19b7-4880-cbf8-9ed7cb25d627"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "770da5f1b8b24972b2018e1cadd3ec8a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "37c46ab78fc64eb98923b24d6a0de37e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "07bf43728198472997c8b59b9343adfe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e227f5f6bb3b4580b0ea4304d34ad556",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "12d4df348cd34dc2b3d7ffc41f0561e8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAPITAL\u001b[0m \u001b[0;30;48;2;166;216;84mIZATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
      "\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m�\u001b[0m \u001b[0;30;48;2;255;217;47m�\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtokens\u001b[0m \u001b[0;30;48;2;166;216;84m False\u001b[0m \u001b[0;30;48;2;255;217;47m None\u001b[0m \u001b[0;30;48;2;102;194;165m elif\u001b[0m \u001b[0;30;48;2;252;141;98m ==\u001b[0m \u001b[0;30;48;2;141;160;203m >=\u001b[0m \u001b[0;30;48;2;231;138;195m else\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m two\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\"\u001b[0m \u001b[0;30;48;2;141;160;203m   \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m Three\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m \"\u001b[0m \u001b[0;30;48;2;141;160;203m      \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
      "\u001b[0m \u001b[0;30;48;2;255;217;47m1\u001b[0m \u001b[0;30;48;2;102;194;165m2\u001b[0m \u001b[0;30;48;2;252;141;98m.\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m*\u001b[0m \u001b[0;30;48;2;166;216;84m5\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m6\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
      "\u001b[0m "
     ]
    }
   ],
   "source": [
    "# You need to request access before being able to use this tokenizer\n",
    "show_tokens(text, \"bigcode/starcoder2-15b\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 220,
     "referenced_widgets": [
      "6a6efb1d66ea423a9b5ff4b2f1f1194c",
      "1283ee793a40405aa9763e1b88d6d7a3",
      "11df3b517fd94524be18cff070e273a8",
      "86b01695df4b42d3a6602d704243b6ee",
      "b0b1039168a74b8fa79271592f29f0b7",
      "877ff6d25f524779a10df09be5fc6093",
      "276ec5fb636b49bcb933a4ce96cb900e",
      "ac88f9025a0e44b38fc14720d810b5ab",
      "4ceb8ce8b67a44b4b7505cd7e589dec1",
      "b0d45aec56fd4219b9224dbe31fad3a3",
      "bf3a8980a70547f5b853390235a37592",
      "f11120105af24fe1b40b6490897e2e2e",
      "8109bb974a3e4b12bd7534b62be20940",
      "f1d6a31870da4e27bc482bb84953b165",
      "1128f56169ac4376ab5fbb46d44b01dc",
      "3da38a2294a145bf86124d0fda8b3255",
      "85113d3b53fb47bb8593c3a21e37142c",
      "1e4a17f723d14694b5aeb673db7394cc",
      "87a4011d2d0a4076b027f7068b244dda",
      "48ca9047fc7d424f99b40c54e6d732f4",
      "3b169b44c1814ed5a7feba7bab0f3ce6",
      "dff4ee8d0bd74822a0adf204e21521b8",
      "5d3f3b08ec5044e3acf4414703e579d9",
      "900ddbabea1846a3a0dfd8380668bec1",
      "7da6e29f0349438494ff83975d48f02a",
      "01f1433d221f437eb0692d25948ce080",
      "9f189b7e32c94a3a84ffd40beed7d1fd",
      "3e5f406442df4b848d324ff584eea75c",
      "575b63bdd98047a4934d556423dd9ce6",
      "c9662398b8ad4d4fa1a59d44f6205769",
      "37379e486478437f9fa2f8eac7f9fd60",
      "2b83154cf934484da54fe0d0b08fe3d3",
      "759faaea712c4a8abb0ece5217e1c470"
     ]
    },
    "executionInfo": {
     "elapsed": 1388,
     "status": "ok",
     "timestamp": 1719589605088,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "KeWcUdxY6I3u",
    "outputId": "f39c8f56-1e71-44bb-bade-75bfb33b581c"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6a6efb1d66ea423a9b5ff4b2f1f1194c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f11120105af24fe1b40b6490897e2e2e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5d3f3b08ec5044e3acf4414703e579d9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAP\u001b[0m \u001b[0;30;48;2;166;216;84mITAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZATION\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
      "\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m�\u001b[0m \u001b[0;30;48;2;166;216;84m�\u001b[0m \u001b[0;30;48;2;255;217;47m �\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
      "\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mtokens\u001b[0m \u001b[0;30;48;2;102;194;165m False\u001b[0m \u001b[0;30;48;2;252;141;98m None\u001b[0m \u001b[0;30;48;2;141;160;203m elif\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m==\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m>\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m t\u001b[0m \u001b[0;30;48;2;102;194;165mabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m\"\u001b[0m \u001b[0;30;48;2;231;138;195m    \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m t\u001b[0m \u001b[0;30;48;2;252;141;98mabs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m       \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
      "\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
      "\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"facebook/galactica-1.3b\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 374,
     "status": "ok",
     "timestamp": 1719589632350,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "__QNj2Cohzz2",
    "outputId": "17ffab73-b07c-44a9-c482-64ab9f4c45a4"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0;30;48;2;102;194;165m<s>\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
      "\u001b[0m \u001b[0;30;48;2;231;138;195mEnglish\u001b[0m \u001b[0;30;48;2;166;216;84mand\u001b[0m \u001b[0;30;48;2;255;217;47mC\u001b[0m \u001b[0;30;48;2;102;194;165mAP\u001b[0m \u001b[0;30;48;2;252;141;98mIT\u001b[0m \u001b[0;30;48;2;141;160;203mAL\u001b[0m \u001b[0;30;48;2;231;138;195mIZ\u001b[0m \u001b[0;30;48;2;166;216;84mATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
      "\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m�\u001b[0m \u001b[0;30;48;2;166;216;84m\u001b[0m \u001b[0;30;48;2;255;217;47m�\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
      "\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mto\u001b[0m \u001b[0;30;48;2;102;194;165mkens\u001b[0m \u001b[0;30;48;2;252;141;98mFalse\u001b[0m \u001b[0;30;48;2;141;160;203mNone\u001b[0m \u001b[0;30;48;2;231;138;195melif\u001b[0m \u001b[0;30;48;2;166;216;84m==\u001b[0m \u001b[0;30;48;2;255;217;47m>=\u001b[0m \u001b[0;30;48;2;102;194;165melse\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203mtwo\u001b[0m \u001b[0;30;48;2;231;138;195mtabs\u001b[0m \u001b[0;30;48;2;166;216;84m:\"\u001b[0m \u001b[0;30;48;2;255;217;47m  \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mThree\u001b[0m \u001b[0;30;48;2;141;160;203mtabs\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m     \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
      "\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
      "\u001b[0m "
     ]
    }
   ],
   "source": [
    "show_tokens(text, \"microsoft/Phi-3-mini-4k-instruct\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9Tu7OY4HvBEm"
   },
   "source": [
    "# Contextualized Word Embeddings From a Language Model (Like BERT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 265,
     "referenced_widgets": [
      "761c3e6c7f26453bba2f463f39f3ae73",
      "a5c45eeafcf4456bbb0fe1bb43ef2497",
      "1f35082ee7ec425ea801321112d48db1",
      "618c1c5ae3ea4650a412a3e009d9fb49",
      "a158991ec46e4587aac2111e02153f4c",
      "51af2c28e34245ed83f02b424a6a640a",
      "56a95848a2814195b334c31f0a961cd8",
      "c1e09c1869f7410ba328e6efe56a0460",
      "17777d8eae4742aaa78d7854a52e102f",
      "5916af8bb7ed4efcb05c8f1cdf826149",
      "b1778299eac04302ac136872dfb0a359",
      "8ee98d9d017542609881a1e16a8f393f",
      "c4a8f08da3f64fb287f7a940c4a9408f",
      "d66ee44b7f5b4c6eac1536235a627441",
      "c44feccdd01949c1af40fa69f6e03dd0",
      "84682eb9cfff444da7633f4ca9360f77",
      "332810f458df4e23bc034852706bcc6f",
      "4999e7d8e2384bf4adfbf1777587a65f",
      "a8d4f19ff4554165a5e78d3783928fe1",
      "29d9e7a799ea402fb5a28b2817156838",
      "222d489bf1664763babf0e377e45f4d8",
      "2ae57535c39541fe98d6a8ae22bcd7d4",
      "fd134a05028c447a994166eccc557806",
      "69101d935ae841e59aa7f30e40789496",
      "3ef63f93e424409192edd1b1364aba48",
      "1d204aedfeb14df5ae7e27eb88a87018",
      "e74b57abfaa2487aa8e369103be5d00d",
      "94609e349c8b43b5b74d2c059623f9f0",
      "febce6a7c96e4a42a9c6faa0bf1763c0",
      "18a4603fa6a040c0acd792243510562a",
      "c04ce575bf624bd1a80113d1eff1ae94",
      "6a9de4aaed054608b800820831aec87f",
      "46cd179c1c09474a80dc4cea39b759d5",
      "20331d1d457143719fe732325e79877e",
      "2a5055c8fc03457eb29390312c555ec5",
      "6de0874e33c146bba06131aa452c403f",
      "0f1d2e4c312d4ab38e359f96c9a760b7",
      "f946c53a81f34d64b25e331fc4b4c7a1",
      "2dc553ef192c4002b818de6736367fe0",
      "665b4085199a4ec5890dba773f07d4b7",
      "de74a5af1ef9462e82e5622232346a79",
      "34df4fe808174f80a89a765f6ce2f28f",
      "56c082332310434fa5ec791b728fe82e",
      "9ef54c7a15d1400f91598b367ac6552e",
      "ed19532a8eda4925a4014a3f11517ff9",
      "3859f0311ade4d278190924f0107ba0e",
      "6ea963b64fd642b29021110d77827021",
      "e3efc5b43bf2417faaea2e44a306fa75",
      "4975826de21547fca434e1fb492a216a",
      "19b547738c8e45bda7879dc527229019",
      "6469ef3a7ba6465c911a22060d54ff95",
      "657b6ce5f0804bc78d64f9b7b6a27777",
      "301363b755a24bc9bf413fc3b0ffd8b2",
      "50fc50f975294a23a1d5bccf64efc872",
      "3c722f92f6c2479a91cf957d1d18fbee",
      "7570614d79184ab2b44700df2342b294",
      "543ae93b2c5f4339979b8c64eff33f62",
      "72b54fe7c88540488adee31c15595c89",
      "3fbdaadec9f545b99057f7c55c6a6df1",
      "7656d1b978554f4383160b93a80ee7c6",
      "4d55d194e62942df8f1c32b9bb244e9b",
      "403f03c3a2434fd2ac6f143c1972c62e",
      "cc68d46e46e7487484d366f77ea863ca",
      "3dee319ce0aa4ce589ce84033bba8d9d",
      "a280161408504bb894ab78526f67750b",
      "72d18d21ca2a45ac868d390ead3ac086"
     ]
    },
    "executionInfo": {
     "elapsed": 5049,
     "status": "ok",
     "timestamp": 1719641476949,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "nsjz-VsYu9bB",
    "outputId": "03ea124b-c6de-449d-ea6f-f5e5b84c2c97"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "761c3e6c7f26453bba2f463f39f3ae73",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8ee98d9d017542609881a1e16a8f393f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fd134a05028c447a994166eccc557806",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "20331d1d457143719fe732325e79877e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ed19532a8eda4925a4014a3f11517ff9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7570614d79184ab2b44700df2342b294",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from transformers import AutoModel, AutoTokenizer\n",
    "\n",
    "# Load a tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"microsoft/deberta-base\")\n",
    "\n",
    "# Load a language model\n",
    "model = AutoModel.from_pretrained(\"microsoft/deberta-v3-xsmall\")\n",
    "\n",
    "# Tokenize the sentence\n",
    "tokens = tokenizer('Hello world', return_tensors='pt')\n",
    "\n",
    "# Process the tokens\n",
    "output = model(**tokens)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 567,
     "status": "ok",
     "timestamp": 1719641482036,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "lQly_KcbvDce",
    "outputId": "fe2cc467-2a5a-4111-8d23-4da9aa799b79"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 4, 384])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 2,
     "status": "ok",
     "timestamp": 1719641482353,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "8GcRrpPV0kVj",
    "outputId": "93766ff1-1ae5-4e90-dba0-286d9e721c3d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[CLS]\n",
      "Hello\n",
      " world\n",
      "[SEP]\n"
     ]
    }
   ],
   "source": [
    "for token in tokens['input_ids'][0]:\n",
    "    print(tokenizer.decode(token))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 1,
     "status": "ok",
     "timestamp": 1719641482353,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "e8oHVC7B0lkk",
    "outputId": "f7dd1e0c-a2db-4ae4-8ccb-c97fa150071a"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],\n",
       "         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],\n",
       "         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],\n",
       "         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],\n",
       "       grad_fn=<NativeLayerNormBackward0>)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DdEDuLWa0r4L"
   },
   "source": [
    "# Text Embeddings (For Sentences and Whole Documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 425,
     "referenced_widgets": [
      "a50156f59d8548b683982af06d5bda09",
      "7cbfb80418f14068bb4327a3823140fb",
      "ef1e2b9a1e694eaa8d5b371caa66277e",
      "ebf8ef569e374c17aa15472ee6ea98d8",
      "75f5174b66c14f5480960c79a44283ad",
      "05504faf760d43ea9082ff5a91ff82f7",
      "50c6cf2be63b4be29524188d79e8cd53",
      "9088f66d46f44030af46aee55005a939",
      "7af424dbfd864b8ba6f7661f2204302c",
      "44aae8dd70da4be98fd86a970373753b",
      "49dd6490203640d1821196fe28a08732",
      "c40e76c5e9bd42bea7903e811bce53a4",
      "8f7d3eea82614e4ca93482ad7cb637a6",
      "05eb94a4253544648654f24268e8b6da",
      "01b9535a72c64c95b482640b2bd3c5fa",
      "b954ab8edd66487cb9a5502879f0e1c5",
      "e6e103b2c53a4b5ca71f988b7804aeaf",
      "77f25a7ee7f843b0a24a1d360a1df211",
      "a0e30058096941dcaab8f24235ee1c49",
      "b26ac1747c884e3a913d90b8c05a991b",
      "e5db8326887a48ae97341cf216d35893",
      "c76f0b056c1f40f3ac37df24c449e9ec",
      "c4c7962b94674cd6898980cea6483595",
      "7fc0e3e3c75e47f88410a2774695081e",
      "6c96c43dd36a4b08ad94a9cad642b1ab",
      "73b403ec39fa4a16883ad08a50c7204b",
      "6feee2b015ef4fa6b94c905ec91f5146",
      "ab68f3b41d954ec992b9339a91def6a8",
      "fb9054b0d27c4a8a9b8241f9d5910e51",
      "78b6d1098f754eb491a2a729e00d2335",
      "e84f055fc0c04c4983b162d1d8c67147",
      "13455551484542ea93d4dbbd937288d5",
      "99db2aef29ba4717866072f20f1acf61",
      "10f9441d843c44e5b11cfb2e21b5d89e",
      "5c987bfb44d14b1ea822c99cb7dde071",
      "52ac1a973f7140e8b49dbf58dc0c8b21",
      "37117814cae9440c9c54f63def546c4b",
      "e019010a35ea4f3a9b9236a626b34760",
      "ba03fe6450b742c99e8b8836f585232c",
      "47e9e2ed70c84023ae3a1dbf0cb27328",
      "99e53690e1c940cea12f581b625c3b3d",
      "f53653a9050042649df9d91114ba39bb",
      "737c25a604ee4bcd982487f79450d3ea",
      "cea92b3b96f24087b494487bf3f4c0f9",
      "784748de51254ba18128af8df30b8a93",
      "11890f81eceb41f7be6a2d52c9a9e55e",
      "f7adb025a07a4aca8b7a3a174304666f",
      "7f7a5bfc6073495da65dcfd4b2d49309",
      "fa6483d07ae54fb2904fc117aa9a3d5b",
      "2cbcf7d0b1384b8cb320e1b30c124d71",
      "94ea21e19acf4eb7a9010510b226db86",
      "f399e1a91e7d4b7eaddfb910bd81d750",
      "f26297a84f224c4da8afd4316c8e7477",
      "e4bef8778ddc46e5a8d0756eb27c3e7c",
      "66943967c327428a9796fcd38c36b24d",
      "f0793746dff34da8858758fb55284b97",
      "1bc7b31eddc54588979f3fff14a0e12e",
      "75f54f7a8e5b4965b6f0ad28e5f3bf26",
      "3cb9178f0568448fa839d5cccc7973d7",
      "40bde45ac20f48ab93c4fd9e8284eac8",
      "09b701e83e3844fab97ff237b06a1238",
      "c1bbe572b8324ea48d42c40a5128bb8e",
      "a4bd81ed4d9d498a983c13ea79265819",
      "d8d65b5ac8914792b82460cf0bae980d",
      "136cd465bac246f2ac2454eea2f0484d",
      "59030392bbde468aae6c62aecddd499e",
      "ca7eb54b296c4a1fa91678c0e3d65f5d",
      "35521a33c1324a928fd2c9f7fce2ce69",
      "80c82a58f0924a578bcae9d3c6537c11",
      "589d24a92b3d4e5c99460cd609c8a230",
      "0edd8bfe5bba47d59c5b195841cc4228",
      "2aed08160bef4a7189821c560c01d6bb",
      "01ca9f66804048a9a75475ab9c49a24e",
      "880c7bdcf3174a78892aa7d0cd11dca7",
      "1d4f85ce80d841eab27b411b3e61e9be",
      "9960dd3bff70458abc148f1a153175ec",
      "0bdc012004ad445fa527fafee0ab55c2",
      "2895fad80e754b4e8158c6dd8db69058",
      "4a2c467901414bf0afc5b310aa959dae",
      "a465317d0f3b4106bbf8fd6c7a3caf6a",
      "7c93b5df16a64e1981e350459b05852d",
      "66d5c0087bd141b3baa502e3aa8bd408",
      "d26ec2f29c154da1ac7cb49cb9729113",
      "7385eea9ed2a438c8dae350fe2328162",
      "c861634df7524e7cbd7fdca030a0b663",
      "9904b80c37c44ed2a6b3c21786016e26",
      "df424fcea3e84c8084dbf8e146d1231f",
      "a1e116cf62d74d4e8e33c99379e924ed",
      "652ddbc085994a36b553ea04359943a1",
      "b340ba4de77043dcbedbcbdf6033d0c7",
      "10c89678b42b4cf0b8e95b74ad346fba",
      "fb22220311094fe2b3d245ff080ca4d7",
      "fc1358383bba4e5ebe2edfca57473002",
      "342adaf9c12548a4af25f5361ca869ca",
      "374287b3d21a427fabd82dc1e0710d62",
      "427e31cb6fcc4687937d807116e5e581",
      "8195fa9ad4e3487c90cbc0860361a336",
      "82bf559e6997425aba4245c44531f762",
      "6bc8887963d744a7a6c15930844f5513",
      "4688efe8ab954510b30df18f8daa74a5",
      "27c697f872b64fb5a43deb255e27fb50",
      "08f2dd1dd0a742eb8c9e11a41dbe69a3",
      "94b1fee85a034b49aba0c50fcdfc0fdb",
      "c21e510111f5425bad28afd9b723f9d1",
      "fea69d561fb94e8c98af4526f8d4b33e",
      "0cec3fa3672b4bbaa20e3d4aae6fd575",
      "2b09f8d112ea43a5b60c010f2bf0bbdb",
      "60ea33df89a449ad9e0f90a0bca672ad",
      "48230299285241a28e2c8e03db6fce4d",
      "3229d1aa3fe74299a1d4e3f917cc2ca6",
      "1f5849878621437397efe2de7b7a43fe",
      "2a554904b31c4ed8a4269d2c26ea4e91",
      "a29c289ad83d4c718d5854e5d3eff48a",
      "07584d1f220f4fb6b82a42090f3818f7",
      "04e248de88ac4a50ad20272d31549304",
      "57bb0b0d061d4e778493adb47482f234",
      "2a24003cc0d54e568706cc6fc77d2831",
      "849583682f034d3d8b8887cbafb3daaf",
      "8ab0464a810e4b208d4c0fc481c58b54",
      "cab50f86e61d43df90643acaf98670ab",
      "d4180478ac134757bcfe6c4f0ff4990e"
     ]
    },
    "executionInfo": {
     "elapsed": 7006,
     "status": "ok",
     "timestamp": 1719641491724,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "TQHWioIc0pQ8",
    "outputId": "87112ec7-bee0-4894-d850-8dd5e0f4e38c"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a50156f59d8548b683982af06d5bda09",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c40e76c5e9bd42bea7903e811bce53a4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c4c7962b94674cd6898980cea6483595",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "10f9441d843c44e5b11cfb2e21b5d89e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "784748de51254ba18128af8df30b8a93",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f0793746dff34da8858758fb55284b97",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ca7eb54b296c4a1fa91678c0e3d65f5d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2895fad80e754b4e8158c6dd8db69058",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "652ddbc085994a36b553ea04359943a1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4688efe8ab954510b30df18f8daa74a5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1f5849878621437397efe2de7b7a43fe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "# Load model\n",
    "model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')\n",
    "\n",
    "# Convert text to text embeddings\n",
    "vector = model.encode(\"Best movie ever!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 2,
     "status": "ok",
     "timestamp": 1719641491724,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "PDwfmBiC0uER",
    "outputId": "db6755ce-92b2-45d1-85aa-9b53baee446e"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(768,)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vector.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xnuGRjo80yKj"
   },
   "source": [
    "# Word Embeddings Beyond LLMs\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 44634,
     "status": "ok",
     "timestamp": 1719641543423,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "sKgNdnwe0vfK",
    "outputId": "180bbb09-b030-4fa0-9198-085b0eb54c7b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[==================================================] 100.0% 66.0/66.0MB downloaded\n"
     ]
    }
   ],
   "source": [
    "import gensim.downloader as api\n",
    "\n",
    "# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)\n",
    "# Other options include \"word2vec-google-news-300\"\n",
    "# More options at https://github.com/RaRe-Technologies/gensim-data\n",
    "model = api.load(\"glove-wiki-gigaword-50\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 2,
     "status": "ok",
     "timestamp": 1719641543423,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "u_vj5NVn01aD",
    "outputId": "73c3edd8-0185-494d-a842-d78cbe100642"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('king', 1.0000001192092896),\n",
       " ('prince', 0.8236179351806641),\n",
       " ('queen', 0.7839043140411377),\n",
       " ('ii', 0.7746230363845825),\n",
       " ('emperor', 0.7736247777938843),\n",
       " ('son', 0.766719400882721),\n",
       " ('uncle', 0.7627150416374207),\n",
       " ('kingdom', 0.7542161345481873),\n",
       " ('throne', 0.7539914846420288),\n",
       " ('brother', 0.7492411136627197),\n",
       " ('ruler', 0.7434253692626953)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar([model['king']], topn=11)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QMSgyKKS4xUx"
   },
   "source": [
    "# Recommending songs by embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3dJdWzT67nDL"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from urllib import request\n",
    "\n",
    "# Get the playlist dataset file\n",
    "data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')\n",
    "\n",
    "# Parse the playlist dataset file. Skip the first two lines as\n",
    "# they only contain metadata\n",
    "lines = data.read().decode(\"utf-8\").split('\\n')[2:]\n",
    "\n",
    "# Remove playlists with only one song\n",
    "playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]\n",
    "\n",
    "# Load song metadata\n",
    "songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')\n",
    "songs_file = songs_file.read().decode(\"utf-8\").split('\\n')\n",
    "songs = [s.rstrip().split('\\t') for s in songs_file]\n",
    "songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])\n",
    "songs_df = songs_df.set_index('id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 3,
     "status": "ok",
     "timestamp": 1724598630488,
     "user": {
      "displayName": "Jay Alammar جهاد العمار",
      "userId": "14617748739431919458"
     },
     "user_tz": 240
    },
    "id": "Q3zirG-lo3H8",
    "outputId": "e3b4269e-dd42-428e-8b28-46c27d0231af"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Playlist #1:\n",
      "  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] \n",
      "\n",
      "Playlist #2:\n",
      "  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']\n"
     ]
    }
   ],
   "source": [
    "print( 'Playlist #1:\\n ', playlists[0], '\\n')\n",
    "print( 'Playlist #2:\\n ', playlists[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EaUz3E0P7sJs"
   },
   "outputs": [],
   "source": [
    "from gensim.models import Word2Vec\n",
    "\n",
    "# Train our Word2Vec model\n",
    "model = Word2Vec(\n",
    "    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 314,
     "status": "ok",
     "timestamp": 1719642095066,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "9EFGWesO8rOJ",
    "outputId": "1e46ce56-7b14-4268-a38a-c328e0f52943"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('2849', 0.9979680776596069),\n",
       " ('2640', 0.9964019060134888),\n",
       " ('3167', 0.9963980317115784),\n",
       " ('5549', 0.9959008693695068),\n",
       " ('2715', 0.9958351850509644),\n",
       " ('3117', 0.9954560995101929),\n",
       " ('2987', 0.9953479766845703),\n",
       " ('2881', 0.9951083660125732),\n",
       " ('2886', 0.9950577616691589),\n",
       " ('3094', 0.994985044002533)]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "song_id = 2172\n",
    "\n",
    "# Ask the model for songs similar to song #2172\n",
    "model.wv.most_similar(positive=str(song_id))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 321,
     "status": "ok",
     "timestamp": 1719642762615,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "AMiY6isXqKk4",
    "outputId": "0f465f20-ada8-4fa8-92d6-f72966d03aa4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title     Fade To Black\n",
      "artist        Metallica\n",
      "Name: 2172 , dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(songs_df.iloc[2172])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 237
    },
    "executionInfo": {
     "elapsed": 556,
     "status": "ok",
     "timestamp": 1719642918281,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "aOzWENxr2Fl3",
    "outputId": "0b1ac29a-14f7-4e30-e153-e8f35ca97d7e"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"print_recommendations(2172)\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"2640 \",\n          \"2715 \",\n          \"3167 \"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Red Barchetta\",\n          \"Rainbow In The Dark\",\n          \"Unchained\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"artist\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Rush\",\n          \"Dio\",\n          \"Van Halen\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-94b64d84-06f0-49f5-a721-a51ab661e5c4\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>artist</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2849</th>\n",
       "      <td>Run To The Hills</td>\n",
       "      <td>Iron Maiden</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2640</th>\n",
       "      <td>Red Barchetta</td>\n",
       "      <td>Rush</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3167</th>\n",
       "      <td>Unchained</td>\n",
       "      <td>Van Halen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5549</th>\n",
       "      <td>November Rain</td>\n",
       "      <td>Guns N' Roses</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2715</th>\n",
       "      <td>Rainbow In The Dark</td>\n",
       "      <td>Dio</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-94b64d84-06f0-49f5-a721-a51ab661e5c4')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-94b64d84-06f0-49f5-a721-a51ab661e5c4 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-94b64d84-06f0-49f5-a721-a51ab661e5c4');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-66b2f5cc-45c9-44ee-b044-69f0e59b123a\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-66b2f5cc-45c9-44ee-b044-69f0e59b123a')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-66b2f5cc-45c9-44ee-b044-69f0e59b123a button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                     title         artist\n",
       "id                                       \n",
       "2849      Run To The Hills    Iron Maiden\n",
       "2640         Red Barchetta           Rush\n",
       "3167             Unchained      Van Halen\n",
       "5549         November Rain  Guns N' Roses\n",
       "2715   Rainbow In The Dark            Dio"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "def print_recommendations(song_id):\n",
    "    similar_songs = np.array(\n",
    "        model.wv.most_similar(positive=str(song_id),topn=5)\n",
    "    )[:,0]\n",
    "    return  songs_df.iloc[similar_songs]\n",
    "\n",
    "# Extract recommendations\n",
    "print_recommendations(2172)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 310
    },
    "executionInfo": {
     "elapsed": 681,
     "status": "ok",
     "timestamp": 1719642181255,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "xqrzQQ-m1EJ5",
    "outputId": "3cf4967d-f510-4772-cb11-4166d16c6956"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title     Fade To Black\n",
      "artist        Metallica\n",
      "Name: 2172 , dtype: object\n",
      "['2849' '2640' '3167' '5549' '2715']\n"
     ]
    },
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"print_recommendations(2172)\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"2640 \",\n          \"2715 \",\n          \"3167 \"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Red Barchetta\",\n          \"Rainbow In The Dark\",\n          \"Unchained\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"artist\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Rush\",\n          \"Dio\",\n          \"Van Halen\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>artist</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2849</th>\n",
       "      <td>Run To The Hills</td>\n",
       "      <td>Iron Maiden</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2640</th>\n",
       "      <td>Red Barchetta</td>\n",
       "      <td>Rush</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3167</th>\n",
       "      <td>Unchained</td>\n",
       "      <td>Van Halen</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5549</th>\n",
       "      <td>November Rain</td>\n",
       "      <td>Guns N' Roses</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2715</th>\n",
       "      <td>Rainbow In The Dark</td>\n",
       "      <td>Dio</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                     title         artist\n",
       "id                                       \n",
       "2849      Run To The Hills    Iron Maiden\n",
       "2640         Red Barchetta           Rush\n",
       "3167             Unchained      Van Halen\n",
       "5549         November Rain  Guns N' Roses\n",
       "2715   Rainbow In The Dark            Dio"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print_recommendations(2172)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 310
    },
    "executionInfo": {
     "elapsed": 316,
     "status": "ok",
     "timestamp": 1719642205517,
     "user": {
      "displayName": "Maarten Grootendorst",
      "userId": "11015108362723620659"
     },
     "user_tz": -120
    },
    "id": "TIHiN62g1NMi",
    "outputId": "c548f528-6e2e-4a46-89e0-6599395d6419"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title     California Love (w\\/ Dr. Dre & Roger Troutman)\n",
      "artist                                              2Pac\n",
      "Name: 842 , dtype: object\n",
      "['5668' '413' '5661' '330' '886']\n"
     ]
    },
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"print_recommendations(842)\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"413 \",\n          \"886 \",\n          \"5661 \"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"If I Ruled The World (Imagine That) (w\\\\/ Lauryn Hill)\",\n          \"Heartless\",\n          \"Sweet Dreams\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"artist\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 4,\n        \"samples\": [\n          \"Nas\",\n          \"Kanye West\",\n          \"The Game\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-1afa899d-2db1-434a-a095-9b7ade3d2589\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>artist</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5668</th>\n",
       "      <td>How We Do (w\\/ 50 Cent)</td>\n",
       "      <td>The Game</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>If I Ruled The World (Imagine That) (w\\/ Laury...</td>\n",
       "      <td>Nas</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5661</th>\n",
       "      <td>Sweet Dreams</td>\n",
       "      <td>Beyonce</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>330</th>\n",
       "      <td>Hate It Or Love It (w\\/ 50 Cent)</td>\n",
       "      <td>The Game</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>Heartless</td>\n",
       "      <td>Kanye West</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1afa899d-2db1-434a-a095-9b7ade3d2589')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-1afa899d-2db1-434a-a095-9b7ade3d2589 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-1afa899d-2db1-434a-a095-9b7ade3d2589');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-a8ceaf3a-b291-4c01-adfc-895ceccda974\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-a8ceaf3a-b291-4c01-adfc-895ceccda974')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-a8ceaf3a-b291-4c01-adfc-895ceccda974 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                                   title      artist\n",
       "id                                                                  \n",
       "5668                             How We Do (w\\/ 50 Cent)    The Game\n",
       "413    If I Ruled The World (Imagine That) (w\\/ Laury...         Nas\n",
       "5661                                        Sweet Dreams     Beyonce\n",
       "330                     Hate It Or Love It (w\\/ 50 Cent)    The Game\n",
       "886                                            Heartless  Kanye West"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print_recommendations(842)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
